Preprocess event data for analysis

by Alberto Diaz-Durana

adiazdurana@gmail.com

Date: 08.02.2021

First we will import the file Example MIX TYPES.txt and have a look at what we have.

Task 1: Understanding the data

The eLetter_ID is the case identifier and the ActivityName is the activity.

Question: How many events do we have?

According to the description given in the assigment, each event has to have at least a case identifier (TransID in the data), a name of the activity that has been executed (ActivityName), a start and a complete timestamp (which will be calculated later on in this notebook). Optionally, an event can have an arbitrary number of attributes (variables or columns in the DataFrame).

Question: How many events do we have for each activity name?

To obtain the number of events for each activity name we have to count the number of observations of case identifiers for each ActivityName.

Question: How many cases do we have?

The number of cases is the ammount of unique case identifiers. Notice also that the case identifiers are go from 1 to 151. So, an alternative solution is to find the heighest number in this column.

Question: There’s an ​eLetter_Type​ attribute. What values does it have and how often do they occur?

Task 2: Transforming the data into an event log

Calculating the timestamps for each row

Our client provided the fix point time indicating the time stamp at which all begins (the genesis 😄). It’s the 01.01.2016 10:00:00

The variable 'timeStamp' is obtained by adding the Time from the attribute 'Time' to the variable fixPointTime

Combining two subsequent rows for the same activity into one with start and complete timestamp

A first step in preparing the event log is to calculate the relative time of each event, which is time the event occurs with respect to the beginning of the process.

Now we inspect the DataFrame to check on the new start and end timestamps.

Notice that the values in ActivityName are still doubled for each values in TransID. The next step is to eliminate the duplicas leaving the rows showing the relative time relativeTime_s.

Task 3: Checking the transformed data

Answer the questions from section I again for the transformed data

Question: How many events do we have?

Question: How many events do we have for each activity name?

We count the number of observations of case identifiers for each ActivityName after the transformation.

Question: How many cases do we have?

Question: There’s an ​eLetter_Type​ attribute. What values does it have and how often do they occur?

Calculate the difference between start and complete time for each event and show the distribution

We can verify that the values in caseStart and in caseEnd were obtained by adding the values in Time to the variable fixPointTime by calculating the relative time between events.

Notice that

(Optional): Surprise us with something interesting you found in the data :-)

Filtering events

One final thing will want to look at is which events are shared by all processes and which are not, since in process mining it is the non-shared differentiating events that we are interested in.

(Super Bonus): Penguins 🐧 in Twitter!

The selected package is twint from https://github.com/twintproject/twint to scrape and to store the tweets in a .json file for each day the process has been running.

To install and be able to run the package in this notebook we will clone the repository as explained in the url above.

Scrapin' Twitter (this could take about 4 hours!)

Uncomment the cell in this subchapter if you would like to to the entire process of scrapping the tweets and placing them in a folder called "penguin". The scrapped tweets in a day will be stored in a json file with the date as filename.

The cell below creates several functions to automate the process of searching over several days and storing each day’s results as distinct json file: twint_loop splits the date range into a series of days and calls twint_search to do the searching for each date. Each json is named after the date and stored in a directory based on the search term, using clean_name to ensure that it is a valide directory name. The date loop

Dowloading a csv with the scrapped tweets (a faster workaround!)

Comment the cell in this subchapter if you would like to run the cells above and see how twitter gets scrapped!

If you decide to use this workaround, a csv file will be downloaded from a url and stored into a DataFrame.

Sentiment analysis

The next cell takes about 5 minutes... So let't go get a cup of tea ☕

These polarity values can be plotted in a histogram, which can help to highlight in the overall sentiment (i.e. more positivity or negativity) toward the subject.

This plot displays a revised histogram of polarity values for tweets on penguins. For this histogram polarity values equal to zero have been removed to better highlight the distribution of polarity values.

Frequency analysis

Let's have a look now at the frequency of mentions per date according to the DataFrame we have just now generated.

Distribution of penguin counts after assigning to events

(Just curious) Visualizing the process with pm4py

You can then represent this model with a petri net and visualise it with the pm4py visualizer object from pm4py.visualization.petrinet.

Before applying one of the many process mining algorithms, it will be informative if we get some numbers describing our log and process. We will start by understanding: how many variants we have? how many cases in each variant?

A process variant is a unique path from the very beginning to the very end of the process

Few activities stands out they have a lot of actions, it could be some sort of self-loop or rework or some other reason.

Alpha Miner

The starting point for the Alpha algorithm are ordering relations (sorted by timestamp ofc) So, we do not consider the frequencies nor we consider other attributes!

DFG - Direct flows graph with frequency and time between the edges